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I.  INTRODUCTION 


1 


Consider  the  multiple  linear  regression  model 

1 M + £ 

where 

^ is  an  n X 1 vector  of  observations, 

X is  an  n X p matrix  of  full  rank, 

^ is  a p X 1 vector  of  unknown  parameters, 
and  £ is  an  n X 1 vector  of  errors. 

2 

In  addition  to  the  usual  assumptions  that  E(£)  = £ and  V(£)  = , 

it  will  also  be  assumed  from  the  onset  that  £ is  Normally  distributed. 

The  most  common  estimator  of  ^ is  the  ordinary  least  squares  (OLS) 
estimator  ^ which  is  given  by 

6 = (X'X)"^X'x  . 

/s 

Under  the  stated  assumptions  ^ has  the  following  five  properties; 

/S 

(1)  ^ minimizes  £*£  = (^  - X3)  ' (y  - • That  is,  _6  is  the  value 

t of  6^  which  minimizes  the  error  sum  of  squares  (the  sum  of  the 

[ 

j squared  deviations  of  the  observations  from  their  expected 

I 

I values). 

t 

I 

(2)  ^ is  the  best  linear  unbiased  estimator  of  That  is,  of 

• /V 

all  linear  unbiased  estimators,  ^ is  the  one  with  minimum 
variance. 
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(3) 


^ is  the  maximum  likelihood  estimator  of 

(4)  ^ is  Normally  distributed  with  mean  ^ and  covariance  matrix 
a^(X'X)”^. 

A 

(5)  The  mean  square  error  (MSE)  of  ^ is 

MSE(6)  = E[(B  - g)'(B  - e)]  = a^Ed/X^), 
where  the  are  the  eigenvalues  of  X'X. 

A 

It  should  be  noted  that  if  X'X  has  any  very  small  eigenvalues,  MSE(^) 

A 

will  be  extremely  large.  Thus,  because  MSE(^)  is  equivalent  to  the 

A 

expected  squared  distance  between  ^ and  the  OLS  estimator  is  likely 
to  produce  estimates  which  are  quite  far  away  from  Small  eigenva’ues 
will  result,  for  example,  when  the  X matrix  is  highly  colllnear,  that  is, 
when  one  column  is  close  to  being  a linear  combination  of  other  columns. 
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II.  RIDGE  REGRESSION 


With  the  objective  of  reducing  MSE,  Hoerl  and  Kennard  [7,8] 
suggested  the  use  of  what  they  termed  "ridge  regression",  so  named 
because  of  its  relationship  to  ridge  analysis  [3,6],  a technique 
used  in  investigating  fitted  quadratic  response  functions.  Ridge 
regression  involves  tlie  use  of  an  estimator  which  depends  on  the 
choice  of  a number  k ^ 0.  This  ridge  estimator,  given  by 


= (X'X  + k^)  X'x 


= . 


has  the  following  five  properties; 


(1)  For  k = 0,  ^ That  is,  the  OLS  estimator  is  a special 

case  of  the  ridge  estimator. 


(2)  For  k > 0,  (^) ' (^)  < That  is,  ^ is  shorter  than 

(3)  ^ is  Normally  distributed  with  mean  W X'Xg  and  covariance 

2 ^ 

matrix  o W,  X'XW,  . Thus,  if  k ^ 0,  6 is  a biased  estimator 
— k k — k 

since  its  mean  is  not  equal  to 

(A)  The  MSE  of  B,  is  MSECS.  ) = o^E[A  /(X.  + k)^]  + k^S'W^B. 

Because  S is  a biased  estimator,  its  MSE  includes  a bias 
~k 

term  in  addition  to  the  variance  term. 


(5)  There  always  exists  a value  k > 0 such  that  the  ridge 
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estimator  6 has  smaller  MSE  than  the  OLS  estimator 


In  other  words,  this  last  property  states  that  although  ^ is  biased 
for  k > 0,  there  does  exist  a value  of  k > 0 such  that  the  resulting 
bias  is  more  than  offset  by  a reduction  in  variance.  Thus,  the  cor- 
responding ridge  estimator  does  provide  a reduction  in  the  MSE  of  the 
OLS  estimator.  The  proof  of  this  by  Hoerl  and  Kennard  [7]  has  been 
the  basis  of  the  excitement  over  the  use  of  ridge  regression.  The 
statistical  literature  is  now  well-represented  by  articles  discussing 
various  aspects  of  ridge  regression,  for  example  [2,  4,  5,  9,  10,  11, 
12,  13,  14,  15,  16,  17]. 


A.  A SIMPLE  EXAMPLE 

For  Illustrative  purposes,  consider  the  problem  of  estimating  the 
mean  of  a univariate  Normal  distribution.  For  this  problem,  the  regres- 
sion model  2.  ~ ^3  + £ is,  of  course,  appropriate.  However,  ^ is  a 
1x1  vector  (i.e.,  a scalar  3 which  denotes  the  population  mean)  and 
the  X matrix  is  equal  to  1^,  an  n x 1 column  vectors  of  ones.  Thus, 
the  ridge  estimator  of  the  mean  3 is  given  by 

3 = (I'l  + kI,)“^],'X 

k I- 

= (n  + k)  ^Zy^ 

=•  ny/(n  + k)  . 

From  property  (4)  on  the  preceding  page. 


MSE(S  ) « na^/(n  + k)^  + k^B^/(n  + k)^. 
k 


! 

1 


By  differentiating  with  respect  to  k,  it  can  be  verified  that  the 

value  k = 0^/6^  yields  the  minimum  MSE.  I 

j 

Unfortunately,  the  optimum  value  of  k involves  not  only  the 

2 i 

unknown  variance  a , but  also  the  unknown  parameter  6 which  is  to  be  i 

1 

I 

estimated.  This  is  also  true  for  the  general  regression  situation.  | 

i 

Thus,  the  problem  of  how  to  choose  the  value  of  k must  be  considered.  i 

B.  CHOOSING  THE  VALUE  OF  k 

In  their  original  papers  [7,8],  Hoerl  and  Kennard  discuss  the  use  \ 

of  a device  they  termed  the  "ridge  trace",  which  is  a plot  of  each 
component  of  ^ against  k.  Their  primary  guideline  Is  to  choose  a 
value  of  k where  the  system  stabilizes,  that  is,  where  the  components 
come  together  and  more  or  less  flatten  out.  However,  relying  on  an 

1 

"eyeball"  judgement  like  this  means  that  my  choice  of  k and  your  choice  ' 

■i 

of  k may  be  completely  different.  In  fact,  your  choice  of  k tomorrow 

may  be  different  from  your  choice  today,  even  for  the  same  problem.  < 

Thus,  there  is  no  objective  way  to  evaluate  the  performance  of  a ridge  I 

estimator  chosen  in  tiiis  manner. 

To  overcome  the  objections  to  using  a subjective  estimate  of  k, 
a number  of  people  have  suggested  various  estimators  for  k.  For  example, 

Hoerl,  Kinnard,  and  Baldwin  [9]  proposed  the  estim..'^,  r 

k - po  /3'6. 

MacDonald  and  Galarneau  [13]  suggested  an  estimator  which  has  the  "correct 
length."  Since 
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E(B'B)  = 3’B  + a Z(l/X  ) 
^ 


the  quantity 


Q = i’i  - o^Z(l/X^) 


is  an  unbiased  estimator  of  Therefore,  for  Q > 0 MacDonald 

and  Galarneau  suggested  using  the  estimator  B > where  6 is  chosen 

~k  ~K 


such  that 


■ 0 • 


For  Q < 0,  they  suggested  choosing  k equal  to  some  prespecified  con- 
stant kp.  Two  specific  choices  are: 


(1)  kg  = 0 (the  OLS  estimator) 


(2)  k^  = (S  = 0), 

0 -k  — 


Both  sets  of  authors  carried  out  simulations  to  evaluate  the  \ 

performance  of  the  proposed  ridge  estimators.  In  general,  the  ridge  | 

estimators  did  well  in  some  portions  of  the  parameter  space  and  not 
too  well  in  other  portions. 

It  should  be  noted  that  when  k is  not  a constant,  the  distributional 
and  MSE  properties  previously  listed  for  a ridge  estimator  do  not  hold, 
since  these  properties  are  conditional  on  k.  Therefore,  there  is  no 
guarantee  that  a value  of  k chosen  by  examination  of  the  ridge  trace 
or  by  application  of  some  rule  will  yield  an  MSE  which  is  smaller  than 
that  provided  by  the  OLS  estimator. 

As  an  aside,  it  should  be  noted  that  in  a Bayesian  framework  if  ^ 
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has  a prior  Normal  distribution  with  mean  0^  and  covariance  matrix 
2 

a^,  then  the  posterior  mean  of  ^ has  the  same  form  as  a ridge  estimator 
with 

, 2,  2 

k = a /Oq. 

So,  in  a sense,  by  choosing  k from  the  data  rather  than  before  the 
data  is  taken,  ridge  regression  involves  an  a posteriori  selection  of 
the  prior  distribution. 

C.  TRANSFORMATIONS 

With  the  usual  regression  model  which  includes  a constant  term, 
a problem  may  be  considered  in  a number  of  different  forms.  For 
example,  the  original  independent  variables  » centered  independent 
variables  “ X ^),  or  standardized  independent  variables 

2 1/2 

(X..  - - X,j)  ) 

may  be  used.  In  addition,  the  dependent  variable  may  be  centered 
or  standardized. 

In  any  event,  all  of  these  transformations  result  in  the  same 

/V 

least  squares  estimator  Usually  the  standardized  form  is  used  in 
calculations,  since  it  is  less  susceptible  to  round-off  error.  In 
this  situation  the  X' X matrix  is  given  in  correlation  form.  Most 
papers  devoted  to  ridge  regression  have  also  assumed  )C'X  in  correlation 
form,  but  this  is  not  required  by  any  of  the  underlying  theory.  How- 
ever, unlike  the  least  squares  situation,  ridge  regression  will  produce 
different  results  for  each  transformation  on  the  independent  variables. 
That  is,  a ridge  estimator  is  not  invariant  to  the  form  of  the  model. 
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III.  THE  PURPOSE  OE  A REGRESSION  INVESTIGATION 


All  regression  investigations  do  not  have  the  same  purpose.  Four 
possible  purposes  are: 

(1)  Estimation  of  J3 

(2)  Prediction  of  ^ (i.e.,  estimation  of  2^o§.) 

(3)  Hypothesis  Testing 

(4)  Optimization  and  Control. 

For  (4)  the  warning  of  Box  [1]  should  be  recalled.  He  pointed  out  that 
"...  to  find  out  what  happens  to  a system  when  you  interfere  with  it 
you  have  to  interfere  with  it  (not  just  passively  observe  it)."  So,  if 
at  all  possible,  multicollinearity  should  be  avoided  by  means  of  a well- 
designed  experiment.  Otherwise,  the  experimenter  may  be  out  on  a limb. 

For  (3)  the  experimenter  is  left  floundering  without  adequate  distribution 
theory  if  a ridge  estimator  is  used  instead  of  the  OLS  estimator. 

It  should  be  noted  that  even  if  the  OLS  estimator  does  not  result 
in  a very  accurate  estimate  of  ^ [purpose  (1)],  this  does  not  necessarily 
mean  that  it  will  do  badly  in  predicting  ^ [purpose  (2)],  as  the  following 
example  illustrates. 

Figure  1 summarizes  a regression  problem  discussed  by  Marquardt  [10]. 
Because  the  regression  model  includes  two  Independent  variables  but  no 
constant  term,  a two-dimensional  plot  is  adequate  for  displaying  the 
problem.  As  can  be  seen  from  the  figure,  the  variables  and  are 
highly  correlated. 

In  general,  an  X'X  matrix  may  be  expressed  as 


Figure  1:  Marquardt's  Regression  Example 


where 


P 


X'X  = ZX,v,v! 
- - ^ 1-i-i 


>...>  Xp  > 0 are  eigenvalues  of  X'X 


and 

V , . . . ,v  are  the  corresponding  normalized  eigenvectors. 

1 ~P 

This  representation  of  X'X  indicates  how  well  the  data  space  is 

-1 

covered,  while  a similar  representation  of  (J^'X)  indicates  how  well 

the  parameter  space  is  covered.  As  indicated  in  Figure  2,  for  this 

example  99%  (1.98/2.00)  of  the  variability  in  the  data  space  is  along 

the  line  X^  = X^,  while  V(S^  + 6^)  is  approximately  1%  (0.51/50.51) 

of  the  total  variance  in  the  parameter  space. 

It  should  be  noted  that  although  V(8^)  and  are  large 

2 ^ 2 

(25.253a  ),  V(yj.)  < .750  for  any  predicted  mean  value  within  region 

I,  the  region  defined  by  the  data.  Even  some  distance  from  this  data 

region,  things  aren't  too  bad  along  the  first  principal  component 

^ 2 

axis  [for  example,  at  point  A V(y  ) = 1.01a  ],  but  aren't  too  good 

A 

^ 2 

along  the  other  axis  [for  example,  at  point  B V(y  ) = 19.12a  ]. 

D 

Points  A and  B are  equidistant  from  the  center  of  the  data  region  I . 

The  moral  of  this  is  that  if  the  concern  is  with  predicting  values 
of  2.  within  the  region  covered  by  the  data,  the  results  of  using  the 
OLS  estimator  are  reasonable  even  if  the  estimate  of  ^ is  not  very 
accurate.  In  general,  multicollinearity  does  not  prevent  good  predic- 
tions of  mean  responses  or  of  new  observations,  so  long  as  these 
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gure  2:  Coverage  of  the  Data  Space  and  Parameter  Space 


r 


predictions  are  made  for  points  within  the  data  region.  Of  course, 
for  points  outside  this  region  it  must  be  realizea  that  extrapolation 
may  be  dangerous  regardless  of  whether  an  OLS  estimator  or  a ridge 
estimator  is  used. 


IV.  SOME  QUESTIONS 


This  report  began  with  the  question  "Is  ridge  regression  a 
panacea?"  and  will  end  with  a number  of  other  questions. 

(1)  Is  ^ = 2^  + £ the  true  model  or  an  approximation? 

Most  research  establishing  the  usefulness  of  ridge  estimators 
Is  based  on  the  explicit  or  Implicit  assumption  that  3^  Is  the  true 
model,  even  outside  the  data  region.  In  many  cases  this  may  not  be  a 
reasonable  assumption. 

(2)  Is  the  OLS  estimator  deficient  or  Is  the  data  deficient? 

There  are  two  ways  of  viewing  the  results  In  a situation  where 

there  Is  a high  degree  of  multlcolllnearity . The  first  is  that  the 
OLS  estimator  gives  bad  results.  The  second  is  that  the  data  Is 
Inadequate  for  the  estimation  task  at  hand. 

(3)  Is  there  a structural  relationship  In  the  data? 

If,  for  example,  flow  rate  is  always  reduced  as  temperature  Is 
increased  because  of  physical  constraints,  a high  degree  of  collinearlty 
will  be  present.  In  this  type  of  situation,  it  might  be  well  to 
incorporate  this  relationship  directly  into  the  statistical  analysis. 

(4)  Can  more  data  be  obtained? 

If  predictions  are  to  be  made  outside  of  the  data  region  and  If 
data  can  be  observed  there,  the  best  bet  may  be  to  observe  data  there 
before  attempting  any  estimation. 

(5)  Assuming  that  a ridge  estimator  may  be  useful  sometimes,  why 
not  always  use  It? 
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It  should  be  recalled  that  despite  the  degree  of  multicollinearity 
(even  if  there  is  complete  orthogonality) , there  always  exists  a value 
of  k > 0 such  that  MSE(B^)  < MSE(E) . 

(6)  What  about  the  lack  of  invariance? 

If  an  experimenter  does  decide  to  use  ridge  regression,  he  or  she 
must  then  consider  the  invariance  problem,  i.e.,  the  form  of  the  regres- 
sion model  to  be  used  in  actually  carrying  out  the  estimation  process. 
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V.  CONCLUSION 


The  potential  user  of  ridge  regression  would  be  well-advised  to 
consider  seriously  the  questions  listed  in  the  previous  section.  In 
addition,  he  or  she  must  remember  that  the  use  of  a ridge  estimator 
involves  a trade-off  between  the  chances  of  gains  in  certain  regions 
of  the  parameter  space  and  the  chances  of  losses  in  certain  other 
regions. 

Although  ridge  regression  may  offer  promise,  its  use  as  a routine 
analysis  method  is  not  without  severe  shortcomings.  Therefore,  the 
question  in  the  title  of  this  report  is  answered  in  the  negative. 
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has  been  offered  as  an  alternative  to  ordinary  least  squares  for  estimating 
particularly  in  those  situations  where  "severe”  multicolllnear Ity  exists  in 
X . Ridge  regression  involves  the  use  of  a ridge  estimator,  which  takes  the 
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form  6(k)  =-(X'X  + kp  X'Z  where  k > 0. 

The  properties  of  ridge  regression  relative  to  those  of  ordinary  least 
squares  are  discussed.  Although  ridge  regression  does  appear  to  offer  promise, 
its  use  as  a routine  analysis  method  is  not  without  shortcomings.  Therefore, 
the  question  in  the  title  is  answered  in  the  negative. 
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